Detecting Text Reuse with Modified and Weighted N-grams

نویسندگان

  • Rao Muhammad Adeel Nawab
  • Mark Stevenson
  • Paul D. Clough
چکیده

Text reuse is common in many scenarios and documents are often based, at least in part, on existing documents. This paper reports an approach to detecting text reuse which identifies not only documents which have been reused verbatim but is also designed to identify cases of reuse when the original has been rewritten. The approach identifies reuse by comparing word n-grams in documents and modifies these (by substituting words with synonyms and deleting words) to identify when text has been altered. The approach is applied to a corpus of newspaper stories and found to outperform a previously reported method.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

N-gram Overlap in Automatic Detection of Document Derivation

Establishing authenticity and independence of documents in relation to others is not a new problem, but in the era of hyper production of e-text it certainly gained even more importance. There is an increased need for automatic methods for determining originality of documents in a digital environment. The method of n-gram overlap is only one of several methods proposed by the literature and is ...

متن کامل

Utterance Segmentation Using Combined Approach Based on Bi-directional N-gram and Maximum Entropy

This paper proposes a new approach to segmentation of utterances into sentences using a new linguistic model based upon Maximum-entropy-weighted Bidirectional N-grams. The usual N-gram algorithm searches for sentence boundaries in a text from left to right only. Thus a candidate sentence boundary in the text is evaluated mainly with respect to its left context, without fully considering its rig...

متن کامل

Detecting Co-Derivative Documents in Large Text Collections

We have analyzed the SPEX algorithm by Bernstein and Zobel [1] for detecting co-derivative documents using duplicate n-grams. Though we totally agree with the claim that not using unique n-grams can greatly increase efficiency and scalability of the process of detecting co-derivative documents, we have found serious bottlenecks in the way SPEX finds the duplicate n-grams. We propose a solution ...

متن کامل

Detection of New Malicious Code Using N-grams Signatures

Signature-based malicious code detection is the standard technique in all commercial anti-virus software. This method can detect a virus only after the virus has appeared and caused damage. Signature-based detection performs poorly when attempting to identify new viruses. Motivated by the standard signature-based technique for detecting viruses, and a recent successful text classification metho...

متن کامل

Hashing and Merging Heuristics for Text Reuse Detection

This paper describes a joint software entry by King Fahd University of Petroleum & Minerals and the University of Sheffield for the text-alignment task at PAN-2014. We employ the three steps of seeding, extension and filtering for text alignment. For seeding we use character n-grams with a variant of the RabinKarp Algorithm for multiple pattern search. We then use an elaborate merging mechanism...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012